fix(daemon): sweep full workspace triple on startup and crash exit#239
Conversation
The daemon orphaned three per-job paths under CLONE_BASE_DIR on SIGKILL
(OOM/eviction/kubelet kill): the clone dir, the token-bearing
${workDir}.cred.sh, and ${workDir}-artifacts. registerExitCleanup and
handleJobCancel removed only the first two even on graceful exit, and the
daemon had no startup sweep (the only reaper lived in app.ts, ran in the
wrong process, and matched *.cred.sh only).
Add src/core/workspace-sweep.ts: removeWorkspaceTripleSync (sync, for
exit/cancel handlers) and sweepStaleWorkspaces (startup TTL reaper, emits
a workspace.sweep log). Wire into both job-executor cleanup sites, daemon
main() startup, and app.ts (replacing its divergent cred-only sweep). New
WORKSPACE_STALE_TTL_MS env (default 1h).
Closes #221
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015v2bspPF1ZTTG9ZZrbBgA1
|
Warning Review limit reached
More reviews will be available in 52 minutes and 35 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (9)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR fixes daemon workspace leaks by ensuring all three per-job paths under CLONE_BASE_DIR (clone dir, ${workDir}.cred.sh, and ${workDir}-artifacts) are cleaned up on cancel/exit and are swept on process startup via a TTL-based reaper.
Changes:
- Added
src/core/workspace-sweep.tswithremoveWorkspaceTripleSync()(exit/cancel) andsweepStaleWorkspaces()(startup TTL sweep + structured log). - Updated daemon and webhook startup to run the shared startup sweep, and updated daemon cancel/exit handlers to remove the full workspace triple.
- Added unit tests and updated docs/config for the new
WORKSPACE_STALE_TTL_MSknob and theworkspace.sweeplog event.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/core/workspace-sweep.ts |
Introduces shared helpers to remove full workspace triples and sweep stale entries with TTL + structured logging. |
src/daemon/job-executor.ts |
Replaces ad-hoc rmSync cleanup with removeWorkspaceTripleSync for exit and cancel paths. |
src/daemon/main.ts |
Runs sweepStaleWorkspaces at daemon boot before capability discovery / WS connect. |
src/app.ts |
Replaces the previous *.cred.sh-only reaper with the shared sweepStaleWorkspaces. |
src/config.ts |
Adds validated workspaceStaleTtlMs sourced from WORKSPACE_STALE_TTL_MS. |
test/core/workspace-sweep.test.ts |
Adds unit coverage for triple removal, TTL sweep behavior, and logging contract. |
docs/operate/configuration.md |
Documents WORKSPACE_STALE_TTL_MS. |
docs/operate/observability.md |
Documents the workspace.sweep log event and fields. |
docs/operate/runbooks/daemon-fleet.md |
Notes daemon startup sweep behavior and operational expectations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * RED phase: the module under test does not exist yet, so the import below | ||
| * fails at resolution time and every case errors. Production code is written | ||
| * in the GREEN phase. |
| } catch { | ||
| // Base dir unreadable / uncreatable: nothing to sweep. | ||
| return { swept: 0, retained: 0, durationMs: Date.now() - startedAt }; | ||
| } |
Summary
Fixes a filesystem + token-leak bug (#221) where three per-job paths under
CLONE_BASE_DIRwere orphaned after a pod SIGKILL, OOM kill, or spot eviction: the clone directory<workDir>, the installation-token-bearing credential helper<workDir>.cred.sh, and the sibling summary directory<workDir>-artifacts. Two independent gaps combined:registerExitCleanupandhandleJobCancelinjob-executor.tsremoved only the clone dir + credential helper (the-artifactsdir was always leaked, even on graceful exit), and the daemon had no startup sweep at all — the only reaper lived inapp.ts, ran in the webhook/orchestrator process (wrong filesystem in production), and was scoped to*.cred.shfiles only.Flow
flowchart TD Kill["SIGKILL or OOM or eviction<br/>also graceful exit and cancel"]:::trig Old["Before<br/>exit and cancel removed clone dir and cred.sh only<br/>app.ts reaper removed cred.sh only"]:::old Leak["Orphaned<br/>-artifacts dir always<br/>clone dir on daemon SIGKILL"]:::old New["After<br/>removeWorkspaceTripleSync removes all three<br/>on exit and cancel"]:::new Sweep["sweepStaleWorkspaces at startup<br/>reaps entries older than WORKSPACE_STALE_TTL_MS<br/>emits a workspace.sweep log line"]:::new Clean["No orphans accumulate"]:::keep Kill --> Old --> Leak Kill --> New --> Clean Sweep --> Clean classDef trig fill:#1f618d,color:#ffffff classDef old fill:#7b2d2d,color:#ffffff classDef new fill:#2c5f2e,color:#ffffff classDef keep fill:#2c3e50,color:#ffffffChanges
src/core/workspace-sweep.ts(parameter-based, config-free):removeWorkspaceTripleSync(workDir)— synchronous best-effort removal of the full triple, forprocess.on("exit")/ cancel handlers; self-guards the empty-workDirscoped-job invariant so no removal can target a CWD-relative.cred.sh/-artifacts.sweepStaleWorkspaces(cloneBaseDir, ttlMs, log)— async TTL reaper; tolerant of a missing base dir and concurrent removal; emits one structuredworkspace.sweeplog line with{swept, retained, durationMs}.src/daemon/job-executor.ts— bothregisterExitCleanupandhandleJobCancelnow callremoveWorkspaceTripleSync(the-artifactssibling is now reaped on crash exit and cancel). Scoped-job guards (workDir === "") preserved.src/daemon/main.ts—sweepStaleWorkspacesruns once at boot, beforediscoverCapabilitiesand the WebSocket connect, so no in-flight job of this lifetime can be reaped.src/app.ts— the inline*.cred.sh-only reaper is replaced by the shared helper (now also sweeps clone dirs + artifacts); deadSTALE_CRED_TTL_MSconstant and now-unused imports removed.src/config.ts— newworkspaceStaleTtlMsfromWORKSPACE_STALE_TTL_MS(default3600000= 1h, positive-int validated).configuration.md(env row),observability.md(workspace.sweepevent),runbooks/daemon-fleet.md(startup sweep).test/core/workspace-sweep.test.ts: triple removal, absent-path no-op, empty-workDirno-op, aged-swept/fresh-retained (exact counts +workspace.sweeplog assertion), partial-orphan stale-cred.shindependence, missing-base tolerance.Why it's safe
cloneBaseDiris pod-local ephemeral storage (no shared volume in the ephemeral-daemon Pod spec) holding only ephemeral per-job workspaces — every jobmkdtemps fresh; there is no persistent repo cache. The sweep runs at startup before any job is accepted, so the mtime-based per-entry reap cannot touch a live workspace.Related issues
Closes #221
Test plan
test/core/workspace-sweep.test.ts— 6 cases, all pass in isolationdocs:build(mkdocs strict + doc-sync + citations) cleanbun test: stash-verified baseline shows this change adds passes with fail/error counts unchanged; remaining failures are pre-existing local DB/Valkey-absent suites. The isolated CI runner (scripts/test-isolated.sh) is the authoritative signal.